Page Digest for Large-Scale Web Services

نویسندگان

  • Daniel Rocco
  • David Buttler
  • Ling Liu
چکیده

The rapid growth of the World Wide Web and the Internet has fueled interest in Web services and the Semantic Web, which are quickly becoming important parts of modern electronic commerce systems. An interesting segment of the Web services domain are the facilities for document manipulation including Web search, information monitoring, data extraction, and page comparison. These services are built on common functional components that can preprocess large numbers of Web pages, parsing them into internal storage and processing formats. If a Web service is to operate on the scale of the Web, it must handle this storage and processing efficiently. In this paper, we introduce Page Digest, a mechanism for efficient storage and processing of Web documents. The Page Digest design encourages a clean separation of the structural elements of Web documents from their content. Its encoding transformation produces many of the advantages of traditional string digest schemes yet remains invertible without introducing significant additional cost or complexity. Our experimental results show that the Page Digest encoding can provide at least an order of magnitude speedup when traversing a Web document as compared to using a standard Document Object Model implementation for data management. Similar gains can be realized when comparing two arbitrary Web documents. To understand the benefits of the Page Digest encoding and its impact on the performance of Web services, we examine a Web service for Internet-scale information monitoring—Sdiff. Sdiff leverages the Page Digest encoding to perform structurally-aware change detection and comparison of arbitrary Web documents. Our experiments show that change detection using Page Digest operates in linear time, offering 75% improvement in execution performance compared with existing systems. In addition, the Page Digest encoding can reduce the tag name redundancy found in Web documents, allowing 30% to 50% reduction in document size for Web documents especially XML documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semantic Constraint and QoS-Aware Large-Scale Web Service Composition

Service-oriented architecture facilitates the running time of interactions by using business integration on the networks. Currently, web services are considered as the best option to provide Internet services. Due to an increasing number of Web users and the complexity of users’ queries, simple and atomic services are not able to meet the needs of users; and to provide complex services, it requ...

متن کامل

The London Walkthrough in an Immersive Digital Library Environment

A new approach to browsing digital libraries is presented in this paper. This approach enables users to comprehend and digest large amounts of information easily. Our approach, which extends Shiaw’s research [31], takes advantage of fully immersive Virtual Reality Environments to preserve context even when a user focuses on a single item. Currently, in web-based digital libraries, users can eit...

متن کامل

Integrating Virtual Globes and Web Service Technologies for Higher-education Teaching and Research

The emergence of Virtual Globe software systems offer tremendous opportunities, such as providing a new generation of learning tools to help students digest large-scale geospatial information about the world, and supporting domain expert analyses in an interactive three-dimensional virtual environment. In the meantime, Web Service technologies, especially those standards-based interoperable geo...

متن کامل

Image flip CAPTCHA

The massive and automated access to Web resources through robots has made it essential for Web service providers to make some conclusion about whether the "user" is a human or a robot. A Human Interaction Proof (HIP) like Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) offers a way to make such a distinction. CAPTCHA is a reverse Turing test used by Web serv...

متن کامل

Class-Oriented Page Invalidation for Caching Dynamic Web Content

Caching dynamic pages at a server is beneecial in reducing server resource demands and it also helps dynamic page caching at proxy sites. Previous work has used ne-grain dependence graphs among individual dynamic pages and underlying data sets to enforce result consistency. Such an approach can be cumbersome or ineecient for a Web site to manage a cache in dealing with an arbitrarily large numb...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003